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Abstract 

We derive exact formulae for the allele frequency spectrum under the coalescent 
with mutation, conditioned on allele counts at some fixed time in the past. We 
consider unlinked biallelic markers mutating according to a finite sites, or infi- 
nite sites, model. This work extends the coalescent theory of unlinked biallelic 
markers, enabling fast computations of allele frequency spectra in multiple pop- 
ulations. Our results have applications to demographic inference, species tree 
inference, and the analysis of genetic variation in closely related species more 
generally. 
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1. Introduction 



A key insight of coalescent theory is that for neutral markers the process 
describing the genealogical relationships between individuals in the sample can 
be separated from the process describing the accumulation of mutations. This 
separation makes the coalescent a powerful tool for modelling and inference, 
particularly in combination with m odern Mont e -Carlo methods for inference 
in population genetics (reviewed in iFelsensteinI (|2004l ): iMarioram and Tavare 



m population ge netics [: 
( 2006fl : IWakeievl (120091) '). 
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In this paper we bring the genealogical and mutation processes back together 
again. We derive analytical expressions for allele frequency spectra given by 
the coalescent process with finite-sites and infinite-sites models of mutation 
conditioned on the number of lineages, and allele frequencies, at some fixed 
time in the past. Our results apply to unlinked biallelic markers. 

Analytical formulae for frequency spectra in single populations have been 
known and widely used for some time. These have been derived for the infinitely- 
many- a l leles model and the finite-sites model using diffusion approximations 



(Ewens 



197: 



rence ( Tavarg 



2004 ) and for the i nfinite-sites model by solving an exact recur- 



1984t lEwensl l2004l) 



Here we derive formulae for the frequency spectra conditioned on allele and 
lineage counts in the past. We assume the standard coalescent process operating 
on a neutral unlinked biallelic marker, with either the finite-sites or infinite-sites 
model of mutation. The conditional frequency spectrum is then given by 



Pr[ro = r|no n, n^- 



rr] 



where Tq is the number of lineages carrying the derived allele at the present, 
no is number of lineages sampled at the present, is the number of distinct 
ancestral lineages at some time r in the past, and r,- is the number of these 
ancestral lineages with the derived allele. 

If we exclude the possib ility of mutatio n then allele frequencies change only 
through coalescence events. ISlatkinI (|l996l ) showed that the frequencies followed 
an urn model and derived the a closed form expression for the conditional fre- 
quency spectrum: 



Pr[ro = r|no = n, = n^, = r^] = 



/ r— 1 \ / n—r—1 \ 
\r-T- — l/ \n-r—r-r — l/ 



(1) 



The importance of these conditional probability formulae lies in their use 
for computing freq uency spectra or likelihoods across multiple populations. 



Nielsen et al.l (|l998l ) showed how ([T]) could be used to m aximum likelihood es 



timates of divergence times and population parameters. iRovChoudhury et al 
(j2008h used a dynamic programming framework to compute exact likelihoods 
for entire population trees 



The methods of both lNielsen et al.l (|l998l ) and iRovChoudhurv et all (|2008l ) 
assume models which exclude the possibility of mutation in all populations 
except the root population. In many contexts, assuming a zero mutation rate 
is completely appropriate. One important example is the analysis of single 
nucleotide polymorphism (SNP) data from closely related populations. However 
other markers have higher mutation rates, and SNP data can be used to analyse 
individuals from more divergent populations or even different species. In both 
cases, assuming a zero mutation rate is unrealistic. 

Our expressions for the conditional frequency spectrum can be viewed as an 
extension of Slatkin's formula ([1]) to handle finite-sites or infinite-sites muta- 
tion. Our formulae can be incorporated into the same dynamic programming 



2 



framework as that developed by iRovChoudhurv et al. I (l2008l) . and used to effi 



ciently compute exact likelihoods, with mutation, across multiple populations 
or species. In a companion papei0 we apply these formulae within a likeli- 
hood algorithm which is used to species trees and population parameters. The 
algorithm is fast enough to infer species trees and parameters for dozens of 
individuals using hundreds of thousands of markers. 

2. Setting the scene 

Consider n individuals sampled from a Wright-Fisher population. Under 
the neutral coalescent model, the number of distinct ancestral lineages at some 
time t in the past follows a pure death process. The rate, backwards in time, 
for going from k distinct ancestral lineages to fc — 1 lineages is k{k — l)/2. We 
assume the standard rescaling of time in terms of effective population size so 
that one unit of time corresponds to 2Ne generations. 

Consider some time r before the present. For any t such that < i < t, 
let lit denote the number of distinct lineages ancestral to the sample at time t. 
The conditional dis tribution of n^ given ng was obtained by iTavara (,198^ (see 
also lGriffithd (llQSOl) ): 



Pr[n, = Hno = n] = ^ ^-^(^-^2 (2^ - Ijl-D'^-^C.-D^W ^ 



k—m 



m\{k — TO)!n(-j,-) 



where — n(n — l){n — 2) • • • (n — fc + 1) and ri(^.-) = n(n + 1) • • • (n + fc — 1). 

We consider a locus with two alleles, which we label red and green. For any 
t with < i < T, let rt denote the number of lineages that carry the red allele 
at time t before the present. Hence < rt < n^. For example, in Figure [1] we 
have no ~ 6, ~ 3, n^ = 4, rt = 1, n,- = 3 and r^ = 1. 

The neutral coalescent model with mutation follows a two-step process. 
First, a tree is generated for the sample under the coalescent process. This 
step is followed by the mutation process, in which nodes of the tree are desig- 
nated as red or green: 

• Under the infinite sites model conditional on a polymorphic site, a single 
mutation is placed uniformly at random along the tree. All descendants 
of the mutation receive the derived red allele, whereas the remainder of 
the tree is assigned the ancestral green allele. 

• Under the finite sites model, mutation along a lineage is modelled by a 
continuous time Markov chain with rate u of mutating from red to green 
and rate v of mutating from green to red. It is assumed that the mutation 



^Bryant, D., Bouckaert, R., Felsenstein, J., Rosenberg, N.A., RoyChoudhury, A. 2001 
Inferring Species Trees Directly from Biallelic Genetic Markers: Bypassing Gene Trees in a 
Full Coalescent Analysis. Submitted to Mol. Biol. Evol. ArXiv preprint 0910.4193 available 
at www.arxiv.org 
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t+h 
t 







Figure 1: An instance of a coaloscont process followed by finite-sites mutation. In this example, 
(no,ro) = (6,3), (nt,rt) = (4, 1), (nt+h,rt+h) = (3, 1) and inr,rr) = (3, 1). 



process is at stationarity along the ancestral lineage, so that the root of 
the tree is assigned a red allele with probability tt — -^^^ and a green allele 



with probability 1 — tt = . 



Tavarel ( 19841 ) explored a model combining the coalescent with an infinite 



alleles mutation model. Equation ^ above is a special case of a more general 
result of Tavare involving the distribution of the number of lines of descent. A 
line of descent is a subtree of the gene tree whose root is one individual among 
the ancestral lineages and which includes all descendants in the gene tree not 
separated from the root by a mutation. Looking backwards in time, the number 
of lines of descent follows the same process as a coalescent except that lineages 
ca n be 'removed' by mutation . 



Ethier and Griffiths! (1987) developed a forward-time process for the coales- 



cent with mutation for the infinite-sites model, acting on multiple linked sites. 
They obtained stationary pro babilities via a recurs i on, which can be evaluated 
using Monte-Carlo techniques ( Griffiths and Tavarel 1996 : Stephens and Donnellv 
[2000) and extended t o mul tiple species (|Nielsenl . 119981 ). An alternative recursion 
was developed by|^ (|l998 ) and used to compute a range of related probabilities. 
The frequency spectrum at a single segregating site satisfy 



Pr[r = r|n = n] cx 



(3) 



(see, for example, eq. 3.8 of Griffiths and Tavari (1998) or eq. 9.63 in EwensI 
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Consequently, the expressions for the frequ ency spectrum of polymor- 
phic s ites do not involve the populati on size 9 (see a lso lRovChoudhurv and Wakelev 
( 2010f )). an observation also made by Ewens (1972) for the infinitely- many-alleles 
mo del. 

Tian and Lin ( 20091 ) propose a model that combines the coalescent and mu- 



tation, but that bears only a superficial resemblance to the coalescent-with- 
mutation model studied here. Their 'colored coalescent' makes the key sim- 
plifying assumption that after a coalescence, the allelic state of the parent is 
independent of the state of the children. This assumption conveniently implies 
that mutations in different lineages are independent, but it is clearly not an 
appropriate proxy for the coalescent with mutation. 



3. Conditioned frequency spectra under a finite-sites mutation model 

In this section we assume the finite sites model of mutation and derive an 
expression for Pr[ro = r\nQ = rt, n.^ = n^, = Vr], the probability of observing 
r red alleles in a sample of n individuals taken from the present, conditional on 
those individuals having n^- ancestral lineages at time r in the past, r,- of which 
also carried the red allele. 

Fix Ur and r,- and define, for < i < r and < rt < nt, 

ft{nt,rt) = Pr[rt = rt\nt = nt,n^ = n^,r^ = r^] Pr[n^ = n^|nt = n*]. (4) 

We abbreviate this quantity to Pr[rt|nt, n,-, r^] Pr[n^|nt]. The goal is to deter- 
mine fo{no,ro), since Pr[n7.|nt] can be computed using 

Lemma 1. Suppose that t < t + h < t (see Figure]^. The following identities 
hold: 

i^i[nt+/i|nt, n^, r^j — , [p) 

-ri[nT-|ntj 

Pr[rt+ft|nt,nt+,i,n^,r^] = Pr[rt+,j |nt+ft, n^, r^], (6) 
Pr[rt|nt,nt+,i,rt+/,.,n^,r^] = Pr[rt|nt, nt+ft, rt+,i]. (7) 

Proof 

The combined coalescence and mutation process is simulated by first generating 
a coalescent tree in the direction of increasing and then evolving the mutation 
process in the opposite direction. As mutation is a forward process, we have 

Pr[r.r|nt+/i,nr] = Pr[rr|nr] 

from which we infer 

Pr[nt+,i|nt,n^,r^] = Pr[nt+,,|nt, n,-] 
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giving ([SJ. Once we condition on r,-, the state of rj+zj depends only on the 
coalescent events between t + h and r, and the associated mutation process. 
Hence 

Pr[rt+ft|nt+,,,n^,r^,nt] = Pr[ri+,i|nt+,i, n^, r^]. 

The final identity ([7]) follows from the mutation process being Markov, once the 
coalescent process is fixed. 



□ 

We marginalise over nt+h and rt+h and apply Lemma[T]to obtain an expres- 
sion for ft in terms of ft+h- 



ft{nt,rt) = Pr[rt|nt, n^, r^] Pr[n^|nt] 

''t + h 

= X! X! Pr[rt|nt,nt+;i,rt+ft,n^,r^]Pr[nt+/,,rt+,,|nt,n^,r^]Pr[n^|nt] 

nt+h=nt rt+h=0 
n-j- ^'t+Zi 

= ^ ^ Pr[rt|nt,nt+,,,rt+,i]Pr[rt+,,|nt+/i,nt,n^,r^]Pr[nt+,i|nt,n^]Pr[n^|nt] 

nt+h=nt rt+h,=0 

= X! X! Pr[i"t|nt,nt+,,,rt+/i]Pr[rt+,,|nt+/i,n^,r^]Pr[nt+,,|nt,n^]Pr[n^|nt] 

nt+h=nt rt+h=0 

= X! X! Pr[i"t|nt,nt+,,,rt+,i]Pr[rt+,,|n(+/i,n^,r^]Pr[n^|n(+/i]Pr[nt+,,|nt] 
= X! X! Pr[i"t|nt,nt+,,,rt+/i]Pr[nt+,,|nt]/t+,,(nt+/i,rt+,,). 

(8) 

We now find expressions for Pr[rt|nt, nt+h,Yt+h] and Pr[nt+h|nt]. 
From the coalescent model for neutral loci, 

[l)h + o{h) ifn' = n-l; 

Pr[nt+ft ^n'\nt^n]^ \ l- {;^)h + o{h) if n' = n; (9) 

o{h) otherwise. 

Assuming h is small, we ignore events with probability o{h) and hence only 
consider the cases of no coalescent events (n' = n) or one coalescent event 
{n' — n — 1) between time t and t + h. For each of these two cases, we consider 
what can happen to the allele counts with and without mutation. 
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Figure 2: Illustration of how the number of red (dashed line) or green (solid line) lineages can 
change at a coalescence. (A) The node at the coalescence is green, so the number of green 
lineages changes. (B) The coalescence is at a red node. 



First consider the case of no coalescent events. The probabihty that more 
than one mutation occurs between time t and time t -\- h is o{h), so we only 
consider values of r' obtained from r by or 1 mutation: r' = r — 1, r + 1, or r. 

For r' = r — 1, the probability that = r is the probability that one of the 
n — (r — 1) green lineages at time t + h mutated into a red lineage. Therefore 

Pr[rf = r|nf = rii+h = n,ri+;i = r - 1] = {n - {r - l))vh + o{h). (10) 

For r' — r + 1, conditional on lit = ^t+h — n, the probability that rt = r is the 
probability that one of the r + 1 red lineages at time t + h mutated into a green 
lineage. As a result, 

Pr[rt = r|nt = nt+,i n,rt+,i = r + 1] = {r + l)uh + o{h) . (11) 

For r' = r, the probability that rf = r is the probability that none of the r red 
lineages or n — r green lineages at time t + h mutated. Thus 

Pr[rt = r\nt = Ut+h = n, Vt+h = r] = 1 - (n - r)vh - ruh + o(h). (12) 

Now consider the second case, in which there is one coalescent event and 
lit-f-h — n — 1. The probability that both a coalescent event and a mutation 
event occur between time t and t + h ia o{h), so we ignore this possibility. 
The number of red lineages can change, depending on whether the node at the 
coalescent event is red or green. If it is red (Figure [2] (B)) the number of red 
lineages will increase from time t + h io time t. If it is green then the number 
of red lineages will remain the same. Hence 

Pr[rt = r\nt n, nt+u Vt+h = r - 1] 

Pr[rt = r\nt ^ n, Ut+h = n - 1, Vt+h = r] 

We now substitute ((TOl) — (IT4|) into ([8]), collecting products of quantities that 



+ 0(1), (13) 
-^+o(l). (14) 
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are o{h). This gives 

Mn, r) = ((n - (r - l))vh + o{h)) [\ " Q ^ + ft+hin, r - 1) 

+ ((r + + o(/i)) - (2) + ^ + 1) 

+ (1 - (n - r)i;;i - ruh + o(/i)) (^1 " Q + o{h)^ ft+h{n, r) 
" W 0(1)) (('!]h + oih)) ft+h{n - 1, r - 1) 



n-1 '7 VV2, 

+ ( ""n-T + ( (2) ^ ~ ^' 

+ o{h) (15) 
= ft+h{n, r - l)(n - r + l)w/i 
+ ft+h{n,r + l){r + l)uh 

+ ft+h{n, r) (1 - (^j h-(n- r)vh - ruh 

+ /,+^(n-l,r-l)^Q'^ + o{h). (16) 
Rearranging, dividing by h, and taking the limit as /i — >■ we obtain 
^/^(?^,^■) = -ft{n,r- l)(n -r + l)u - ft{n,r + l){r + l)u 

+ Mn, r) ( (^) + {n - r)v + ru 



n m 



Yl ^{n,ry,(m,q)ft{m, q). (17) 
m=l q=0 
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Here, Q is a matrix with rows and columns indexed by pairs {n,r), with 



Q{n,r);(n,r-i) = {n - r + l)v 0<r<n 

Q(n,r);(«,r+1) = {r + l)u 0<r <n 

Q(n,r);(n,r) = - - {n - r)v - ru 0<r <n (18) 
(n — 1 — r)n 

Q(n,r);(n-l,r) = ^ < r < n 

Q(n,r);(n-l,r-l) = ^ < r < n 

and all other entries zero. 

Note that /t(n, r) is bounded for all t < t, but might be undefined at t — t 
if n 7^ tIt-. When n Ur we have Pr[n^ = n^lnt = n] as t ^> r, so 
/t(n, r) — > 0. Hence 

, . . fl ifn = n^ and r = r^; 

lim /t(n,r) = <^ . (19) 

t-fT I otherwise. 

This result provides the boundary conditions for the differential equation ([T7)) . 
Solving and substituting t = 0, we obtain 

/o(n,r) = exp(QT)(„_r);(n^,r,)- (20) 
We have now established 
Theorem 1. Let Q be the matrix defined in (jl8p . Under the finite sites model, 



DTI 1 exp(QT)(„_r).(„^^r^) 

Pr[ro = r|no = n, = n^,r^ = ?>] = -^77 . r- (21) 

Pr [rir = Tir I no = rtj 

As a corollary to Theorem [1] we derive a new formula for the stationary 
probabilities Pr[ro|no], which represent the probability of observing Tq individ- 
uals carrying the red allele in a sample of size Uq. A closed form approximation 
for Pr[ro|no] can be obtained by way of a diffusion approximation. Under the 
diffusion model, the allele proportions in the entire population have an approx- 
imately beta-distribution and so Pr[ro|no] follows a beta-binomial distribution 
( Ewens . 1 2004 ). The distribution we derive here does not use an explicit diffusion 



approximation but is instead based solely on the assumptions of the coalescent 
model. It gives probabilities very close to a beta-binomial, though the prob- 
abilities are not exactly the same. The difference presumably stems from the 
slightly different model assumptions underlying the standard diffusion and coa- 
lescent models. 
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Consider a single population from which we have taken Uq = n samples. 

n rit 

Pr[ro r|no ^ n] = ^ ^ P^^t = f^t,I■t = rt\no = n] Pr[ro = r|no = n,nt 

nt — l rt— 

E" Pr[nt = nt,rt = rtluQ — n] 
„,=ir~t) Print = nt\no = n] 

n nt 

= X! X! ^^t*"* = |no = nt = nt] exp(Qi)(„_r).(„^_^j) 

Tit = l rt=0 
n 

nt = l rt=0 

We take the limit of the right hand side as, t — > 00. For this computation 
we examine the spectrum of Q. The structure of Q makes this quite straight- 
forward. 



A'" 










B'-' 


A'"" 
































B"" 


A"" 



Figure 3: The bidiagonal block structure of the matrix Q. 

Order the rows and columns of Q according to the pairs (n, r), in order (1, 0), 
(1,1), (2,0), (2,1), (2,2), (3,0),.... Then Q has block bidiagonal form as in 
Figure [3l For example, when 71 = 4 we have 
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Since Q is block triangular, the eigenvalues of Q are exactly the eigenval- 



= nt,rt = rt] 
(22) 
(23) 
(24) 



10 



ues of the diagonal blocks A^-^^ . . . , A^"'^ as can be seen by decomposing the 

characteristic polynomial det(Q - AI) into det(A(i) - AI) det(A(") - AI). 

Furthermore, each block A*^*) equals the rate matrix of a birth death process, 
with the value i{i — l)/2 subtracted from the diagonal. Hence A^'^ has s trictly 
negative eigenvalues when i > 1 (see e.g. Grimmett and Stirzaker ( 200l[ )). By 
inspection, A*^^) has one zero eigenvalue and one negative eigenvalue. Hence Q 
has one zero eigenvalue while the remaining eigenvalues are strictly negative. 
From the diagonalisation of Q we see that 



lim exp(Qi) = xy"^, 

t— >oo 



(25) 



where y is a left 0-eigenvector of Q and x-^ is a right 0-eigenvector of Q such 
that y^'^x = 1 . A left eigenvector is given by 

y= [1,1,0,0,...,0]^. 

The corresponding right eigenvector is found by solving the recurrence implied 
by Qx = 0, that is 



r(l) 



U 



V 



U + V U + V 



xW = 'i?«x('-i) 
Substituting into (j25|) we have 



2,3,., 



,n. 



lim exp(Qt)(„_r);( 



nt,rt) 




if nt = 1 
otherwise. 



Most of the summation terms in (j24p are zero, leaving 

Pr[ro = r|no = n] = lim (Pr[rt = 0|nf = 1] + Pr[rt = l|nt = l])x(„ 

^(Ti,r) ■ 



(26) 
(27) 

(28) 



(29) 
(30) 



We now have an exact formula for allele frequency spectrum for a biallelic marker 
under the coalescent. Since each matrix A'^'^ is tridiagonal, each equation 



z = 2,3,. 



takes only 0{n) time to compute ( Golub and van Loan . 1996[) , making O(n^) 
time to compute all of the probabilities Pr[ro = r|no — n]. 

Theorem 2. Let n and r be the number of lineages and the number of red 
lineages sampled from a single population of constant size. Let Q be the matrix 
defined in (|18p and let x be a non-zero solution for Qx — 0, scaled so that 
X(-x.o) + ^(1.1) — 1- Then for all n, r, Pr[r = r|n = n] = X(-„_j.). 
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4. Conditional frequency spectra under the infinite-sites mutation 
model 

We now consider the infinite-sites mutation model. Suppose that mutations 
accumulate along lineages at a constant rate fi and that, at time t in the past, 
all lineages carry the ancestral green allele. The allele distribution for a site 
under the infinite-sites model is then obtained by conditioning on there being 
exactly one mutation between time and time r. To this end, let Mt denote the 
event that at most one mutation has occurred, over all the lineages, between 
time T and time t. Apart from the inclusion of this additional event and some 
simplifications there is little difference between the analysis here for the infinite 
sites model and the earlier finite sites model analysis. 

Fix Ut and r-r and define, for < t < t, 

gt{n, r) = Pr[Mf , rj = r|nf = n, n^ = n^, r^ = r^-] Pr[nT- = nr\nt = n]. (31) 

This is simply the function ft{n,r) with the inclusion of the event Mj. We 
wish to determine gQ(n,r). The first step is to incorporate Mt into two of the 
identities in Lemma [T] 

Lemma 2. Suppose that t < t + h < t. Then the following identities hold: 

Pi-[Mt+h,rt+h\nt,nt+h,nr,rr] = Pr[Mt+/,., rf+ft|nt+,i, n^, r^], (32) 

PT[Mt,rt\nt,Mt+h,rt+h,nt+h, rr, rir] = Pr[Mt, rt |nf , Mt+/i, rt+/i, nt+h, r,-, n,-]. 

(33) 

Proof 

The combined coalescence and mutation process is simulated by first generating 
a coalescent tree in the direction of increasing t, and then evolving the mutation 
process in the opposite direction. Once we condition on r,-, the state of Mt+h 
and Tt+h depends only on the coalescent events between t + h and r, and the 
mutation process. Hence 

PT[Mt+h, rt+h\nt+h, rir, r^, nt] = Pr[Mf+ft, rt+h\nt+h, nr, rr]. 

Note that M,- holds by assumption. As in Lemma [TJ the final identity p3p 
follows from the mutation process being Markov once the coalescent process is 
fixed. 



□ 

We marginalise over nt+h and rt+h and apply (|32l) and (|33|) to give an 
expression for gt in terms of gt+h- This differs from the finite sites case only by 
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the inclusion of the events Mt. 



gt{nt,rt) = Pr[Aft,rt|nt,n^,r^] Pr[n^|nt] 

= ^ ^ PT[Mt,rt\nt,Mt+h,nt+h,rt+h,nr,rr]Pr[Mt+h,nt+h,rt+h\'nt,nr,rr]PT[nr\nt] 

nt-\-h = i rt + h=n 

= ^ ^ Pr[Afi,rf|nt, A//t+,i,nt+,i,rf+,i] Pr[Aff+,i,rf+,Jnf+,i,n^,r^] Pr[nf+,i|nt,n^] Pr[n 

7lt nt + h 

= ^ ^ Pr[Aft,rf|nt,A/t+,i,nt+,i,rf+,i]Pr[Mf+,i,rf+,Jnf+,i,n^,r^]Pr[n^|n4+h]Pr[nf+,i 

= X! X! VT[Mt,Yt\nt,Mt+h,T^t+h,'Ct+h]PT:[T^t+h\T^t]gt+h{nt+h,rt+h)- 

(34) 

As before, many terms in the final summation of p4p are o{h). We consider 
the exceptions. 



(i) a coalescent event, where the parent is red 

r - I 

Pr[Mt, rt = r\nt = n, Mt+n, ^t+h =n-l, Tt+n = r - 1] = + o(l); 

n — 1 

(ii) a coalescent event where the parent is green 

Tl — 1 — T 

Pr[Mf , rt = r|nt = n, Mt+h,'^t+h = n - 1, Vt+h = A = h o(l); 

n — 1 

(iii) the case of the mutation ancestral to all red lineages (r = 1) 

Pr[Mt, rt = l|nt = n, Mt+h, ^t+h = 0, Ut+h = n]= n^h + o{h); 

(iv) the case of no coalescent events and no mutation events 

Pr[M(, Vt = r|n( = n, Mt+h, ^t+h = r, xit+h = n] = 1 - n/i/i + o{h) 

The main difference with the finite sites model is that having Mj hold implies at 
most one mutation event occurring between time t and r. This event is handled 
in case (iii). 
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Substitute cases (i) — (iv) into ([Ml) to obtain 

oil)] (('']h + oih))g,+^{n~l,r) 



n-1 '7 VV2^ 

+ S{r,l) [nuh + o{h)) gt+h{n,{)) 

+ (1 - nuh + o{h)) - Q + o{h)] gt+h{n, r), 

where gt{n, r) and gt+h{n, r) are zero unless < ?' < n and 1 < n. 
Rearranging and taking the limit as h — > we obtain 

4:9t{n,r) = - ^ R{n,r)-{n'r')gt{n,r'), 



dt 

where R is the matrix given by 



R-(n.O);(n,l) = < U 

(n — 1 — r)n 

R(ri,r);(n-l,r) = = < T < n 



2 

(r — l)n 



R-(n,r);(«-l,r-l) = ^ <r <n 



2 



R(n,r);(n,r) = -(„|-'^U <r <n 



and all other entries 0. 

As in the finite sites case we have the boundary conditions 

J 1 \i n = Ur and r = r^; 
lim gt{n, r) — < 

I otherwise. 

Solving for r = gives 

go{n,r) = exp(RT)(„,r);(n^,rO 



and so 



Pr[Mo,ro = r|nQ = n, = n^,r^ = r^] = /o(n,r) 

_ exp(RT)(„^r);(n^,r^) 



Pr[nT- = nr|no = no] ' 

Normalising now gives 
Theorem 3. Let R be the matrix defined in (j36p . Under the infinite 
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model, at a segregating site 



Pr ro = r no = n, = n^, = = ^^^^j — — . (41) 

Lr' = iexp(RT)(„^^,).(„^_0) 

for < r < n. 

We note that if r,- > then the distribution of rg is determined solely by 
the coalescent events, and 

/ r—l \ / n—r—1 \ 

Pr[ro = r|no = n, = n^, = r^J = — ^ , (42) 



see lSlatkinI (|l996[) . 



5. Conclusion 



We have derived exact formula for coalescent-based likelihoods from un- 



linked biallelic markers. Our results generalise urn model results of iSlatkin 
(Hii) by incorporating mutation, under both a fi nite sites or infinite sites 
model. As a consequ e nce, t he methods developed bv iNielsen et al. ( 1998 ) and 
RovChoudhurv et al. ( 20081 ) for analysing SNP data from single and multiple 



populations can also be extended to include mutation. This will have less of an 
impact for single population analyses, but will be of increasing importance in 
analyses of multiple, closely related, species. 

One consequence of our results which will have immense practical significance 
is that coalescent likelihood calculations can be carried out without the need 
to integrate over gene trees. This is critical if coalescent analyses are to go 
beyond the analyses of small numbers of genes. Indeed, the formulae derived 
here enable a full coalescent analysis of hundreds of thousands of unlinked SNP 
loci. 
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